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ABSTRACT 

National data was obtained from 9-year-old, 
13-year--old, 17- year-old, and 26 through 35-year-old populations in 
order to determine academic achievement in nine subject areas. For 
each age population, group data was calculated and reported by 
region, sex, color, parents' edacat:ional level, and size and type of 
community. The application of singular value decomposition of 
nonsquare matrices to this data is described and its relationship to 
principal components analysis and its data reduction value is 
explained. Exploratory analyses are being conducted to determine if 
the same bases occur across age levels, across time from one 
assessment to its reassessment, and across subject areas. Emphasis 
will remain on trying to relate the characteristics of exercises to 
major differences in performarce through the use of orthogonal 
components. (Author/BJG) 
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EXPLORING NATIONAL ASSESSMENT DATA USING 
SINGULAR VALUE DECOMPOSITION 

Judith M. Sauls and Robert C. Larson 

National Assessment of Educational Progress 
Education Commission of the States 

Introduction to National Assessment 

The National Assessment of Educational Progress (NAEP) , 
a project of the Education Commission of the States, was estab- 
lished to collect reliable information on the achievement of 
educational outcomes by assessing the skills, knowledges, and 
attitudes of America's young people. . Now in its sixth year of 
field assessment. National Assessment has obtained data in nine 
different subject areas: Science, Citizenship, VJriting, Reading, 
Literature, Social Studies, Music, Mathematics, and Career and 
Occupational Development. Both Science and VJriting have been 
assessed twice. All exercises administered by NAEP are based 
on important educational goals as determined by panels of edu- 
cators, scholars, and laymen. Each subject area is assessed 
periodically in order to determine growth or decline in educa- 
tional attainment. 

National data is- obtained for four age populations — 9-year- 
olds, 13-year-olds, 17-year-olds, and young adi^ts aged 26-35; 
these four age levels represent the end of primary, intermediate, 
secondary, and post-secondary education. Within each age, group 
data is calculated and reported by region, sex, color, parents' 
educational level, and size and type of community. 
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A national probability sample is used to identify respon- 
dents resulting in about 2100-2200 respondents per exercise. 
Each respondent takes only a portion of the total number of 
exercises administered at an age level. Scores for individuals 
are not obtained. The exercise, rather than the individual r is 
the basic unit of interest. For each exercise^ NAEP estimates 
the proportion of people at a given age who can complete the 
exercise successfully. Using the national probability sample^ 
it is possible to estimate national percentages correct on each 
exercise within certain limits of accuracy. Estimates of group 
performance are also obtained for each exercise. Typically^ group 
data is reported relative to national by finding the difference 
between the group and national percentages correct. This differ- 

A ^ o 4- 1^ /-» M/*r r% r-« W ^ ^ J i ^ J -i A. l-» — 

performance of the group on a particular exercise. 

For example y the following exercise was given to 13-year-olds 
during the 1969-70 assessment of Science: 

Which of the following is true of hot water as 
compared with cold v;ater? 

It is denser. 

It is easier to see through. 
Its molecules are moving faster. 
It has more free oxygen dissolved in it. 
It has more free hydrogen dissolved in it. 

I don't know. 




The estimated national percentage correct for 13-year-olds was 
61%. Group data were also calculated for four regions (Northeast 
Southeast^ Central^ and West) ^ for males and females^ for Blacks 
and Whites^ for four levels of parents* education (no high school 
some high school^ graduated high school^ and post high school)^ 
and for size and type of community (extreme rural ^ inner city^ 
affluent suburb^ rest of big city^ urban fringe^ medium city^ and 
small places) • 

For this exercise^ then^ the follov/ing national and group 
data were obtained: 

Region Sex Parents' Education 

NAT NE SE C W M F NHS SHS GHS PHS 

61 2.4 -10.6 5.8 -0.7 0.6 -0.4 -16.4 -9.2 -0.9 5.9 



STOC Color 
ER IC AS RBC UF MC SP W B 



-15.1 -19.8 11.4 -2.7 -0.4 1.0 4.4 4.5 -24.4 



The Data Reduction Technique of Singular Value Decomposition 

Exercise .''evel data are obtained on about 100-150 exercises 
per age level per subject area^ resulting in a huge data base. 
NAEP has currently been exploring a data reduction technique^ 

singular value decomposition (SVD) ^ utilizing a computer 
algorithm for the decomposition of non-square matrices designed 



by Golub and Reinsch (1970) • Our purpose is to determine if any 
meaningful r underlying orthogonal dimensions can be found to 
describe the relations among exercises and among groups. 

Because of NAEP's unusual data base in which different 
national samples of individuals respond to only a subset of the 
exercises r some of the usual descriptive and correlational pro- 
cedures are not applicable. For example r factor analysis 
techniques r which attempt to explain observed relations among 
numerous variables in terms of simpler relations are not di- 
rectly applicable to KAEP's basic percentages of correct group 
responses. The basic unit of interest is not the individual 
but the exercise and how certain groups of individuals perform 
on that exercise. The variables of in^;erest to NAEP are not 
repeated measures on the same unit but are different classifi- 
cations of the same set of respondents. 

One alternative data reduction technique is the general 
method of singular value decomposition which obtains both left- 
and right-hand orthonormal characteristic vectors of a non-square 
matrix. NAEP is using the efficient decomposition procedure 
proposed by Golub and Reinsch to factor exercise by group 
arrays r of dimension approximately 100 X 20, whose matrix ele- 
ments are the group effects associated with each exercise. For 
a given age level and subject area^ this data array contains the 
basic information for an age level for any one assessment. By 
using SVDr we gain an economy of description — for both exercises 
and for groups of respondents at an age level — by obtaining 



orthonormal bases for the spaces spanned by exercise vectors 
(rows) and by the group vectors (columns) . Thus^ we obtain at 
once information concerning the simple relations among exercises 
and oTiong groups. 

The purpose of this paper is to describe singular value 
decomposition of non^square matrices^ to show its relationship 
to principal components analysis^ and to illustrate its data 
reduction value for exploring National Assessment data. 

• The Method of Singular Value Decomposition 

A statement of the basic theorem used in singular value 
decomposition can be found in most texts on linear algebra. 
Ilorst (1963^ 1965) also gives an extensive discussion on the 
interpretation of this procedure, which be calls finding the 
basic structure of a matrix. The theorem states that a real 
m X n (m >^ n) matrix X can always be expressed as the product 
of three matrices: 

X = UZV" 

where U is an m x n orthonormal matrix such that U^U = I . 

n' 

V is an n X n orthonormal matrix such that V^V = I . and 

E is a diagonal matrix of dimension n x n. The diagonal matrix 
E contains non-negative elements, sj for j = 1 to n, called the 
singular values, which are arranged in descending order of magni- 
tude from upper left to lower right. The number of non-zero 
singular values is equal to the rank r < n of the matrix. 
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The first r columns of U form an orthogonal basis for the 
column vectors of X and the first r columns V (rows of V^) form 
an orthogonal basis for the row vectors of X. 

For a square, symmetric matrix, U = V and the orthonormal 
basis of columns and rows are same. For the well-known case of 
X = R, a correlation matrix, the columns of U (multiplied by 
the corresponding singular value) are the principal components 
of R and the decomposition is referred to as the principal com- 
ponents analysis of R. For any square symmetric matrix, the 

elements of the diagonal matrix Z are equal to the square root 

2 

of the characteristic roots; that is, Sj =^ /XJ or sj == Xj. 

When SVD is used with National Assessment data, we begin 
with an exercise by group data array A. Each element of A, 
Ap£ j , is the relative performance ot group j on exercise i. 
We can represent A as: 

groups 



A = -H 



« 1 



a 
u 

<D I 

0) m 



— ^\ 

Ap's 

\ / 



Next, each column of the data array A is centered about the column 
mean and standardized to unit variance. This standardizing pro- 
cedure has been introduced to equalize the variability of the 
group effects across exercises. It can be shown that the vari- 
ability of group effects depends on the population sizes of the 
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groups. Small groups shov; much more variability across exercise 
than do the larger groups. Standardizing by columns is one way 
to eliminate this variability and weight each group equally in 
the decomposition. 

Conceptually^ we now have a new matrix X v;hich can 
written as 



where each column vector x. is m x 1. The columns of U form 



an orthogonal basis for the colximns of X. We can represent U as 



X = UZV" 



That isr X can be represented as: 



X = 





u 




where each column vector Uj is m x 1. The first r columns of 
U form an orthonormal basis for the column vectors of X. 



The total column variance of X is the sum of the n column 
variances, or n. The total column variance can also be expressed 
as: 

trace X^X = trace VZ^V 

; , m 

« 1 trace E^W 
m 



= 1 trace Z 
m 

r 2 
= 1 .r, s. . 
m 3=1 D 

If \Te multiply each vector u^ by its corresponding singu- 
lar value Sj, we obtain SjUj, a vector of elements SjU^j which 
correspond to the usual factor loadings of exercise i on prin- 
cipal factor (or component) s^Uj. The variance of each vector 
SjUj is 

2 2 

1 s . ur u. = Is.. 

The total variance of the set of s.u- for j = 1 to r is 

r 2 ^"^ 
i ^A' which is equal to the total column variance of X. 

Thus, the proportion of total column variance of X accounted for 

2 

by the orthoncrmal vector s.u. of X is s /mn. 

3-3 j 
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One might note that X^X is equal to the correlation matrix 

m 

R among groups and can be factored as follows: 



R = 1 (VSU') (UZV") 
m 



« 1 VZ V", 



The columns of V are the characteristic vectors of R and the 



columns of 1 ZV are the principal components of R, The char 



acteristic rootSf Aj^ of R are equal to sym* That is^ if we 

had begun with the correlation matrix and found R = PAP'' 

(P'^P = I) ^ where P contains the characteristic vectors of R and 



A is a diagonal matrix containing the characteristic roots ^ then 
the relationship betv/een the characteristic roots X . of R and 



the singular values s. of X is: 



A . = s ,/m, 
3 3 



In both cases r the amount of variance attributable to component 



j of X and the corresponding component j of R is the same; both 
2 

are equal to s /m. The proportion of total column variance is 
j 

also the same for each component j of either X or R and is 
2 

equal to s /mn« 
3 

In addition to the factor loadings of s-u. for the columns 
of X^ one can obtain the factor loadings for the rows of X from 
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the V matrix. V can be represented as; 



l2 



V 

-r+l 



If we multiply each vector Vj^ by its* corresponding singular value 

Sj^^ we obtain Sj^Vj^ whose elements Sj^v^j^ correspond to the usual 

factor loadings of group I on component k. The amount of vari- 

2 

ance attributable to component k of ZV is equal to s /n. The 

iv 

proportion of total row variance due to component k is equal to 
1 2 

(~ s, / total row variance of X) . Note that the total row variance 
n k 

is not oqual to n since the rows were not centered or standard- 
ized. 

Ap plicaticn of SVD Using National Assessment Data 

The application of SVD using National Assessment data has 
centered on the interpretation of the column vectors of Ur the 
orthogonal basis for the column vectors of X. Basically ^ we 
want to interpret the orthogonal components by a) correlating 
group performance vectors with the orthogonal components and 
b) considering the relative vreighting of exercises on each 
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component and associating by inspection exercise characteristics 
with the orthogonal dimensions and group performance. 

To illustrate how SVD is used at National Assessment^ con- 
sider the data obtained fov 17-year-olds in the first assessment 
of Science (1969-70).* One hundred twenty-four exercises were 
administered to several national samples of 17-year-olds. For 
each exercise r data were calculated and reported for groups 
defined by region^ sexr color ^ parents* education^ and size and 
type of community. Each column of the 124 x 19 data array was 
centered and standardized ^ resulting in a new matrix X which was 
then factored using SVD. The singular values of Z and the per- 
centage of total column variance of X accounted for by each 
component are presented in Table 1. The first three components 
of U account for 54% of the total colum.n variance of X. 

One way of interpreting these components is to determine 
how group performance correlates v/ith the components of U. 
Table 2 presents the results of the correlations of columns of 
X with the first three columns of U. From these correlations^ 
it appears that the first component is related to relative group 
standing. Large positive correlations occur for groups that 
typically perform above the national level of performance — the 



* The original data matrix of group effects^ the standardised 
matrix X, and the three matrices Z, and V found from SVD 
can be obtained from the authors. The size of these arrays 
prohibits their publication in this paper. 
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affluent suburb^ whites^ and the post high school parents' edu- 
cation group. Large negative correlations occur for groups that 
typically perform below the national level — the Southeast ^ the 
extreme rural ^ the inner city^ Blacks ^ and the two lov;est cate- 
gories of parents' educational level (no high school and some 
high school) . Correlations close to zero occur for groups that 
typically perform close to the national level — Central^ rest of 
big city, and small places. 

The second component appears to be related to male-female 
differences. That is^ after accounting for differences in 
relative group performance^ the next largest orthogonal component 
separates male and female performance.* For Science^ males tend 
to typically perform better than females so that a male-female 
component seems logical. 

The third component shows only two large correlations — 
a positive one for Northeast and a negative one for Central. 
Thusr this component appears to measure some sort of regional 
effect which is orthogonal to relative group performance and 
male-female differences. 

After examining correlations of groups with components of 
we can look at the correlations of linear combinations of 
groups r such as male-female ^ to further clarify the interpre- 
tations of the components. This was done by taking differences 
of all pairwise columns of X and correlating these resulting 
vectors v;ith columns of U. The highest correlations with com- 
ponent one of U were found for differences between high 
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performance and low performance groups. The highest correlation 
(-•96) was found for Black-post high school, two of the most 
extreme groups in terms of relative performance. High correla- 
tions v;ere also found for Southeast-post high school (-•92), 
White-Black (.91), extreme rural-post high school (-•90), both 
none and some high school-post high school (both at -•89), and 
Southeast-VZhite (-•89)^ These correlations tend to confirm 
the interpretation of the first component as relative group 
performance • 

For component two, the largest correlation (^88) found was 
with the male-female vector • Other high correlations were 
found for male-rert of big city (.86), and female-small places 
(-•86) • The highest correlation (.89) with component three was 
Northeast-Central . 

Since there does seem to be linear combinations of groups 
that correlate highly with the components of U, another helpful 
analysis might be to find the linear combination of group vec- 
tors that provides the maximum possible correlation with each 
orthogonal component. This can be accomplished through a simple 
application of multiple regression analysis where the col\amns 
of X are taken as the independent variables and the orthogonal 
colxamns of U are taken in turn as the dependent variable. 

Since we know that levels of parents* education are surro- 
gates for or correlated with levels of income and type of 
comunity (STOC) , it v;ould be helpful if one could find ortho- 
gonal components that separate parents* education from income 



level. Again r one can take in turn the two sets of vectors for 
the variables PEd and STOC as the independent sets and find the 
linear combination which correlates maximally with each ortho- 
gonal compcnent. 

The next step is to try to interpret the orthogonal compo- 
nents in terms of the exercise characteristics. For each 
component v;e can obtain an ordering ^ from high to low^ of the 
exercises. Then by simple inspection it is possible to associ- 
ate exercise characteristics with this ordered set of exercises 
to determine if there is a strong association between certain 
exercise characteristics and the ortljogonal components. If a 
strong relationship exists, then one v;ould expect exercises with 
common characteristics to collect at each v^nd of the ordered 
orthogonal vector. In Science, for example, each exercise was 
characterized by type of Science (physical, biological, or other), 
by objective,* and by whether the correct answer might be com- 
monly learned from a book or from another source (book/non-book 
category). By examining each of these ordered vectors, v;e 
found that none of these three classifications appeared to be 
strongly associated with any of the orthogonal components at 
age 17. To the extent that we have already shown that some of 



* The objectives used in the first assessment of Science v/ere: 

I. Know the fundamental facts and principles of Science, 

II. Possess the abilities and skills needed to engage in the 
processes of science. III. Un^ ^rstand the investigative nature 
of science, and IV. Have attitudes about and appreciation of 
scientists, science, and the consequences of science that 

stem from adequate understandings. 
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these orthogonal components are strongly associated with certain 
major group differences in performance, we might conclude that 
these differences occur on exercises regardless of science type, 
objective, or source of learning. 

The previously mentioned characteristics are all related 
to content and learning. As developers and administrators of 
large numbers of exercises each year, NAEP is also concerned ' 
with the relation between "non-content" characteristics of exer^ 
cises and differences between group performances. For example, 
position of correct response to a multiple choice exercise, 
reading level of the exercise, format (multiple choice, open- 
ended, or multiple choice with "I don't know" foil), position 
in package, time ala.owed to respond to the item, and so on 
are non-content characteristics that one usually hopes are 
unrelated to differences in group performance. Thus, we can 
use these techniques to gain some large scale item-analysis 
information. 

Conclusion 

National Assessment collects and reports much data for 
each assessment area. The method of singular value decomposition 
is being used to determine the underlying dimensions of that 
data base. Exploratory analyses are being conducted to determine 
if the same bases occur across age levels, across time from* one 
assessment to its reassessment, and across subject areas. Em- 
phasis will remain on trying to relate the characteristics of 
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exercises to major differences in performance through the use 
of orthogonal components. 

Further development of these initial analysis procedures 
is also under way and can be sketched briefly. The method of 
association of exercise characteristics with orthogonal components 
can be made objective through the use of elementary correlation 
procedures. One might construct a matrix of exercise character- 
istics and obtain its orthogonal components. Then correlate 
these components with the exercise characteristic vectors using 
procedures described for correlating performance vectors. One 
might then correlate the exercise characteristic vectors to the 
performance vectors through multiple regression or canonical 
correlation procedures. 

Another area that needs some attention is the method of 
standardizing the columns of the original performance matrix X. 
Recall we had standardized each col\amn to unit variance^ making 
the total variance equal to n^ the number of coliamns. Since each 
of the five variables (region^ sex^ etc.) is really a reclassi- 
fication of the same population of 17^year-olds^ one might argue 
that the variables^ rather than the groups within each variable, 
should be given equal weight. For example, standardizing each 
column to unit variance gives STOC, which has seven groups, 7/n 
of the total column variance, while sex with only tv;o groups has 
a proportion of 2/n. Thus, the STOC variable carries three and 
a half times the weight of the sex variable. A simple solution 
to this problem is to weight each group within a variable by 

16 
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multiplying by l//g" where g is the number of groups within a 
variable. Then the sum of the column or group variances for 
each variable would be one and the total column variance for X 
would fce equal to the number of variables. 

These are just a few of the possibilities for further ana- 
lysis. Clearly, the basis of these analyses is the flexibility 
and adaptability of SVD to data sets of different types. 
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Table 1 

Results of SVD for 17-Year-Olds, 
Science (1969-70) 



Component Singular Value Cum. % of Total Column Variance 

1 27.53 32.18 

2 17.56 45.27 

3 14.44 54.11 

4 13.42 61.75 

5 12.95 68.87 

6 11.73 74.72 

7 10.91 79.77 

8 9.96 83. S8 

9 9.60 87.89 

10 8.29 90.80 

11 8.07 93.56 

12 7.20 95.77 

13 6.52 97.57 

14 6.34 99.27 

15 3.19 99.71 

16 1.90 99.86 

17 1.49 99.95 

18 0.85 99.98 

19 0.52 100.00 
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Table 2 

Correlations of Columns of X with 
Th.^ First Three Columns of U 



Components of U 
(Columns of U) 



Groups 



iv^oxuiuns or a; 






U3 


Region NE 


.32 


-.08 


.72 




• / D 


27 


. 04 


c 


.07 


-.27 


-.73 


w 


.48 


.01 


-.14 


bex M 




. oo 


03 


JC 




- ft7 


- 17 


STOC ER 


-.55 


.36 


-.16 


IC 


-.74 


-.18 


.05 


AS 


.62 


-.05 


.23 


RBC 


.06 


-.49 


.06 


UF 


.34 


-.35 


.38 


MC 


.34 


.09 


-.29 


SP 


-.24 


.47 


-.26 


Color W 


.83 


.07 


-.28 


B 


-.87 


-.10 


.12 


Parents Ed. NHS 


-.73 


-.06 


.06 


SHS 


-.82 


.11 


.08 


GHS 


.17 


-.16 


-.32 


PHS 


.91 


-.04 


.05 
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